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Abstract 



Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine 
learning optimization problems such as SVM, due to their strong theoretical guarantees. While the 
closely related Dual Coordinate Ascent (DCA) method has been implemented in various software pack- 
ages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic 
Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees 
that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical 
applications. 

1 Introduction 

We consider the following generic optimization problem associated with regularized loss minimization of 
linear predictors: Let x\, . . . , x n be vectors in M. d , let <f>\, . . . , cj) n be a sequence of scalar convex functions, 
and let A > be a regularization parameter. Our goal is to solve mm weR d P(w) where 



For example, given labels y\, . . . ,y n in {±1}, the SVM problem (with linear kernels and no bias term) is 
obtained by setting 4>i(a) = max{0, 1 — y^a}. Regularized logistic regression is obtained by setting </>i(a) = 
log(l + exp(— yid)). Regression problems also fall into the above. For example, ridge regression is obtained 
by setting 4>i(a) = (a — yi) 2 , regression with the absolute-value is obtained by setting <pi(a) = \a — yi\, 
and support vector regression is obtained by setting 4>i{a) = max{0, \a — yi\ — v}, for some predefined 
insensitivity parameter v > 0. 

Let w* be the optimum of ([T]). We say that a solution w is ep-sub-optimal if P(w) — P(w*) < ep. We 
analyze the runtime of optimization procedures as a function of the time required to find an e p-sub-optimal 
solution. 

A simple approach for solving SVM is stochastic gradient descent (SGD) |[T3l ITT1 HI. SGD finds an 
ep-sub-optimal solution in time 0(l/(Aep)). This runtime does not depend on n and therefore is favorable 
when n is very large. However, the SGD approach has several disadvantages. It does not have a clear 
stopping criterion; It tends to be too aggressive at the beginning of the optimization process, especially 
when A is very small; While SGD reaches a moderate accuracy quite fast, it's convergence becomes rather 
slow when we are interested in more accurate solutions. 




(1) 
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An alternative approach is dual coordinate ascent (DCA), which solves a dual problem of ([TJ). Specifi- 
cally, for each i let <j>\ : E — > R be the convex conjugate of namely, </>|(tt) = max 2 (zu — (^i(-z)). The 
dual problem is 



max D(a) where D(a) 



1 ™ 



i=i 



An ^ / 1 



i=l 



(2) 



The dual objective in ^ has a different dual variable associated with each example in the training set. At 
each iteration of DCA, the dual objective is optimized with respect to a single dual variable, while the rest 
of the dual variables are kept in tact. 
If we define 



atiXi, 



(3) 



i=i 



then it is known that w(a*) = w* , where a* is an optimal solution of It is also known that P(w*) = 
D(a*) which immediately implies that for all w and a, we have P(w) > D{a), and hence the duality gap 
defined as 

P(w(a)) -D(a) 

can be regarded as an upper bound of the primal sub-optimality P(w(a)) — P(w*). 

We focus on a stochastic version of DCA, abbreviated by SDCA, in which at each round we choose 
which dual coordinate to optimize uniformly at random. The purpose of this paper is to develop theoretical 
understanding of the convergence of the duality gap for SDCA. 

We analyze SDCA either for L-Lipschitz loss functions or for (l/7)-smooth loss functions, which are 
defined as follows. 



Definition 1. A function 



is L-Lipschitz if far all a, b E M, we have 
\<fn(a) -&(&)| <L|o-6|. 



A function fa : M — > M is (1 /^-smooth if it is differentiahle and its derivative is (lfj)-Lipschitz. An 
equivalent condition is that for all a,b 6l, we have 

<Ma) < Mb) + <t>'i(b)(a -b) + ^-(a- bf . 

It is well-known that if fa(a) is (l/7)-smooth, then <p*(u) is 7 strongly convex: for all u,v E M and 
a €[0,1]: 

-^{su + (1 - s)v) > -sftiu) - (1 - sWM + ls{l ~ S \ u - vf. 
Our main findings are: in order to achieve a duality gap of e, 

• For L-Lipschitz loss functions, we obtain the rate of 0(n + 1? j (Ae)). 

• For (l/7)-smooth loss functions, we obtain the rate of 0((n + 1/(A7)) log(l/e)). 

• For loss functions which are almost everywhere smooth (such as the hinge-loss), we can obtain rate 
better than the above rate for Lipschitz loss. See Section [5] for a precise statement. 
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2 Related Work 



DC A methods are related to decomposition methods iflOl 151. While several experiments have shown that 
decomposition methods are inferior to SGD for large scale SVM |[TTl l6ll. Hsieh et al. [3 ] recently argued that 
SDCA outperform the SGD approach in some regimes. For example, this occurs when we need relatively 
high solution accuracy so that either SGD or SDCA has to be run for more than a few passes over the data. 

However, our theoretical understanding of SDCA is not satisfying. Several authors (e.g. HJEl) proved 
a linear convergence rate for solving SVM with DCA (not necessarily stochastic). The basic technique is 
to adapt the linear convergence of coordinate ascent that was established by Luo and Tseng Q. The linear 
convergence means that it achieves a rate of (1 — u) k after k passes over the data, where v > 0. This 
convergence result tells us that after an unspecified number of iterations, the algorithm converges faster to 
the optimal solution than SGD. 

However, there are two problems with this analysis. First, the linear convergence parameter, v, may 
be very close to zero and the initial unspecified number of iterations might be very large. In fact, while 
the result of Q does not explicitly specify v, an examine of their proof shows that v is proportional to the 
smallest nonzero eigenvalue of X T X, where X is the n x d data matrix with its i-th row be the z-th data 
point x-i. For example if two data points Xi ^ Xj becomes closer and closer, then v — > 0. This dependency 
is problematic in the data laden domain, and we note that such a dependency does not occur in the analysis 
of SGD. 

Second, the analysis only deals with the sub-optimality of the dual objective, while our real goal is to 
bound the sub-optimality of the primal objective. Given a dual solution a € W 1 its corresponding primal 
solution is w(a) (see Q). The problem is that even if a is €d -sub-optimal in the dual, for some small e/j, 
the primal solution w(a) might be far from being optimal. For SVM, [4, Theorem 2] showed that in order 
to obtain a primal ep-sub-optimal solution, we need a dual ed -sub-optimal solution with erj = 0(\e 2 p ); 
therefore a convergence result for dual solution can only translate into a primal convergence result with 
worse convergence rate. Such a treatment is unsatisfactory, and this is what we will avoid in the current 
paper. 

Some analyses of stochastic coordinate ascent provide solutions to the first problem mentioned above. 
For example, Collins et al analyzed an exponentiated gradient dual coordinate ascent algorithm. The 
algorithm analyzed there (exponentiated gradient) is different from the standard DCA algorithm which we 
consider here, and the proof techniques are quite different. Consequently their results are not directly com- 
parable to results we obtain in this paper. Nevertheless we note that for SVM, their analysis shows a con- 
vergence rate of 0(n/ez)) in order to achieve eD-sub-optimality (on the dual) while our analysis shows a 
convergence of 0(n log log n + 1/ Ae) to achieve e duality gap; for logistic regression, their analysis shows 
a convergence rate of 0(n 2 log(l/e£>)) in order to achieve eo -sub-optimality on the dual while our analysis 
shows a convergence of 0((n + 1/A) log(l/e)) to achieve e duality gap. 

In addition, lPT2l . and later [9] have analyzed randomized versions of coordinate descent for uncon- 
strained and constrained minimization of smooth convex functions. [3, Theorem 4] applied these results to 
the dual SVM formulation. However, the resulting convergence rate is 0{n/er>) which is, as mentioned be- 
fore, inferior to the results we obtain here. Furthermore, neither of these analyses can be applied to logistic 
regression due to their reliance on the smoothness of the dual objective function which is not satisfied for 
the dual formulation of logistic regression. We shall also point out again that all of these bounds are for the 
dual sub-optimality, while as mentioned before, we are interested in the primal sub-optimality. 

In this paper we derive new bounds on the duality gap (hence, they also imply bounds on the primal 
sub-optimality) of SDCA. These bounds are superior to earlier results, and our analysis only holds for 
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randomized (stochastic) dual coordinate ascent. As we will see from our experiments, randomization is 
important in practice. In fact, the practical convergence behavior of (non-stochastic) cyclic dual coordinate 
ascent (even with a random ordering of the data) can be slower than our theoretical bounds for SDCA, and 
thus cyclic DCA is inferior to SDCA. In this regard, we note that some of the earlier analysis such as [7 ] can 
be applied both to stochastic and to cyclic dual coordinate ascent methods with similar results. This means 
that their analysis, which can be no better than the behavior of cyclic dual coordinate ascent, is inferior to 
our analysis. 

The following table summarizes our results in comparison to previous analyses. Note that for SDCA 
with Lipschitz loss, we observe a faster practical convergence rate, which is explained with our refined 
analysis in Section [5] 



Lipschitz loss 



Algorithm 


type of convergence 


rate 


SGD 

online EG flU 
SDCA 


primal 
dual 
primal-dual 


0(f) 
0(n + -h) or faster 


Smooth loss 


Algorithm 


type of convergence 


rate 


SGD 

online EG H 
SDCA 


primal 
dual 
primal-dual 


O(^) 
0(n 2 logf) 
6((n+{) log i) 



3 Basic Results 

The generic algorithm we analyze is as follows. 



Procedure SDCA(a(°)) 




LeW ) =w(a<®) 




Iterate: fort = 1,2,... ,T: 




Randomly pick i 




Find Aai to maximize -0*(-(aj t-1) + Acti)) - 


- 1 ) + (Xn)- 1 AaiXi\\l 






w (t) ^_ w^- 1 ) _|_ ( / \n)~ 1 Ao:iXi 




Output (Averaging option): 




Let a = T ^ To Ei=T +i a(t 1} 




Let w = w(a) = t ^ Tq Ei=r +i w{t 1} 




return w 




Output (Random option): 




Let a = a' 4 ' and w = for some random t £ Tq + 1, . 




return w 





We analyze the algorithm based on different assumptions on the loss functions. To simplify the state- 
ments of our theorems, we always assume the following: 
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1. For all i, \\xi\\ < 1 

2. For all i, (pi (a) > 

3. For alH, <fo(0) < 1 

Theorem 1. Consider Procedure SDCA with = 0. Assume that pi is L-Lipschitz for all i. To obtain a 
duality gap ofE[P(w) — D{a)\ < ep, it suffices to have a total number of iterations of 

4 1? 20 1? 

T>T + n+ - — > max(0, \n log(0.5AnL~ 2 )] ) + n + -r . 

Aep Xep 

Moreover, when t > To, we have dual sub-optimality bound of K[D(a*) — D(a^)] < ep/2. 

Remark 1. If we choose the average version, we may simply take T = 2T$. Moreover, we note that 
Theorem [7] holds for both averaging or for choosing w at random from {Tq + 1, . . . , T}. This means 
that calculating the duality gap at few random points would lead to the same type of guarantee with high 
probability. This approach has the advantage over averaging, since it is easier to implement the stopping 
condition ( we simply check the duality gap at some random stopping points. This is in contrast to averaging 
in which we need to know T, Tq in advance). 

Remark 2. The above theorem applies to the hinge-loss function, (pi(u) = max{0, 1 — Hid}. However, for 
the hinge-loss, the constant 4 in the first inequality can be replaced by 1 ( this is because the domain of the 
dual variables is positive, hence the constant 4 in Lemma^can be replaced by 1). We therefore obtain the 
bound: 

L 2 5L 2 
T>T + n+- — > max(0, \n log(0.5AnL~ 2 )] ) + n + - — . 

Xep Xep 

Theorem 2. Consider Procedure SDCA with c^ ' = 0. Assume that pi is (l/j) -smooth for all i. To obtain 
an expected duality gap of¥,[P{w^ T " > ) — D^a^)] < ep, it suffices to have a total number of iterations of 

T>(n+^) log((n+ 



Moreover, to obtain an expected duality gap ofl$\P{w) — D{a)\ < ep, it suffices to have a total number of 
iterations of 

T > (n+i) log((n+^)- IT 3^). 

Remark 3. If we choose T = 2Tq, and assume that Tq > n + l/(Xj), then the second part of Theorem^ 
implies a requirement of 

T >(n+^) log(i), 
which is slightly weaker than the first part of Theorem^when ep is relatively large. 

Remark 4. The estimation error of the primal objective behaves like (^)- Therefore, an interesting 
regime is when t- = 0(e). In that case, the bound for both Lipschitz and smooth functions would be 
0(n). Another interesting regime is when we would like e <C jp, but still ^ = 0(1). In that case, smooth 
functions still yield the bound 0(n), but the dominating term for Lipschitz functions will be j-. 

Remark 5. The runtime ofSGD is 0(j-). This can be better than SDCA if n ^> However, in that case, 
SGD in fact only looks at n' = O(j-) examples, so we can run SDCA on these n' examples and obtain 
basically the same rate. For smooth functions, SGD can be much worse than SDCA if e ^ 4r. 
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4 Using SGD at the first epoch 



From the convergence analysis, SDCA may not perform as well as SGD for the first few epochs (each echo 
means one pass over the data). The main reason is that SGD takes a larger step size than SDCA earlier 
on, which helps its performance. It is thus natural to combine SGD and SDCA, where the first epoch is 
performed using a modified stochastic gradient descent rule. We show that the expected dual sub-optimality 
at the end of the first epoch is 0(1/ (\n)). This result can be combined with SDCA to obtain a faster 
convergence. 

We first introduce convenient notation. Let Pt denote the primal objective for the first t examples, 



Pt{w) 

The corresponding dual objective is 



1 * 



- T \ ii n2 

„w Xi) H — \\w\\ 

t ^ r.v U -r 2 || || 



A («) 
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i=i 



\t aiXi 

i=l 



Note that P n (w) is the primal objective given in ([I]) and that D n (a) is the dual objective given in (|2]). 

The following algorithm is a modification of SGD. The idea is to greedily decrease the dual sub- 
optimality for problem £>*(•) at each step t. This is different from DC A which works with D n (-) at each 
step t. 



Procedure 


Modified-SGD 


Initialize: u/°) = 




Iterate: for t = 1, 2, . . . , n: 




Find at to maximize — (/>*(— at) — 


%\\ w (t-i) + (\ t )^a t x t \\l 


Let «,(*) = ^ Y?i=i 




return a 





We have the following result for the convergence of dual objective: 
Theorem 3. Assume that is L-Lipschitz for all i. In addition, assume that 



are iid samples from 



the same distribution for all i 



1, . . . , n. At the end of Procedure Modified-SGD, we have 

2L 2 log(en) 



E[D(a*) - D(a)} < 



An 



Here the expectation is with respect to the random sampling of {((pi, Xi) : % = 1, . . . , n}. 

Remark 6. When A is relatively large, the convergence rate in Theorem \3^for modified-SGD is better than 
what we can prove for SDCA. This is because Modified-SGD employs a larger step size at each step tfor 
Dt(a) than the corresponding step size in SDCA for D(a). However, the proof requires us to assume that 
((j>i,Xi) are randomly drawn from a certain distribution, while this extra randomness assumption is not 
needed for the convergence of SDCA. 
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Procedure SDCA with SGD Initialization 

Stage 1: call Procedure Modified-SGD and obtain a 
Stage 2: call Procedure SDCA with parameter a*- ** = a 



Theorem 4. Assume that is L-Lipschitz for all i. In addition, assume that ((j>i,Xi) are iid samples from 
the same distribution for all i = 1, . . . , n. Consider Procedure SDCA with SGD Initialization. To obtain a 
duality gap ofK[P(w) — D(a)] < ep at Stage 2, it suffices to have a total number of SDCA iterations of 

4T 2 20 L 2 

T > T + n + - — > [nlog(log(erc))l + n + . 

Aep Aep 

Moreover, when t > To, we have duality sub-optimality bound of K[D(a*) — D(a^)] < ep/2. 

Remark 7. For Lipschitz, loss, ideally we would like to have a computational complexity of 0(n+L 2 / (Aep)). 
Theorem^shows that SDCA with SGD at first epoch can achieve no worst than 0(ralog(logn)+T 2 /(Aep)), 
which is very close to the ideal bound. The result is better than that of vanilla SDCA in Theorem^when A 
is relatively large, which shows a complexity of 0{n log(n) + L? j (Aep)). The difference is caused by small 
step-sizes in the vanilla SDCA, and its negative effect can be observed in practice. That is, the vanilla SDCA 
tends to have a slower convergence rate than SGD in the first few iterations when A is relatively large. 

Remark 8. Similar to Remark^ the constant 4 in Theorem^can be reduced to 1, and the constant 20 can 
be reduced to 5. 



5 Refined Analysis for Almost Smooth Loss 

Our analysis shows that for smooth loss, SDCA converges faster than SGD (linear versus sub-linear con- 
vergence). For non-smooth loss, the analysis does not show any advantage of SDCA over SGD. This does 
not explain the practical observation that SDCA converges faster than SGD asymptotically even for SVM. 
This section tries to refine the analysis for Lipschitz loss and shows potential advantage of SDCA over SGD 
asymptotically. 

Although we note that for SVM, Luo and Tseng's analysis shows linear convergence of the form 
(1 — v) k for dual sub-optimality after k passes over the data, as we mentioned, u is proportional to the 
smallest nonzero eigenvalue of the data Gram matrix X T X, and hence can be arbitrarily bad when two data 
points X{ / Xj becomes very close to each other. Our analysis uses a completely different argument that 
avoids this dependency on the data Gram matrix. 

The analysis relies on the following refined dual strong convexity condition. 

Definition 2. For each i, we define 7i(-) > so that for all dual variables a and b, and u G d(f)*(—b), we 
have 

0* (-a) " #(-&) + «(a - 6) > 7i(«)|a " rf- (4) 

For the SVM loss, we have 4>i{u) = max(0, 1 — uyi), and <f)*(— a) = —ayi, with ayi G [0, 1] and 
Ui e {±1}. It follows that 

4>*i ( — a ) — < t ) *i{—b) + u ( a — b) = (b — a)yi + u(a — b) = \uyi — l||a — b\ > \uyi — 1| • |a — b\ 2 . 
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Therefore we may take 7i(u) = \uyi — 1|. 

For the absolute deviation loss, we have <pi(u) = \u — yi\, and </>*(— a) = — ayi with a 6 [—1, 1]. It 
follows that 7i (tt) = \u — yi\. 

Proposition 1. Under the assumption of ([4]). Let 73 = ji(w* T X{), we have the following dual strong 
convexity inequality: 

1 n X 

D{a*)-D{a) > -Y 7i \ ai -a*\ 2 + -{w-w*) T {w-w*). (5) 

i=l 

Moreover, given w G M d -aj S d(j)i(w T Xi), we have 

\(w* - w) T Xi\ > ji\ai - a* \. (6) 

For SVM, we can take ji = \w* T Xiyi — 1|, and for the absolute deviation loss, we may take 7$ = 
\w* T Xi — yi\. Although some of 7^ can be close to zero, in practice, most 7« will be away from zero, which 
means D(a) is strongly convex at nearly all points. Under this assumption, we may establish a convergence 
result for the dual sub-optimality. 

Theorem 5. Consider Procedure SDCA with = 0. Assume that fa is L-Lipschitzfor all i and it satisfies 
([5]>. Define N(u) = #{i : 7, < u}. To obtain a dual- suboptimality ofE[D(a*) — D{a t )] < eo, it suffices 
to have a total number of iterations of 

t > 2(n/s)log(2/e D ), 
where s G [0, 1] satisfies eo > 8L 2 (s/Xn)N(s/\n)/n. 

Remark 9. if N(s/\n)/n is small, then Theorem^is superior to Theorem^for the convergence of the 
dual objective function. We consider three scenarios. The first scenario is when s = 1. If N(l/Xn)/n 
is small, and erj > 8L 2 (1 / An) N (1 / An) / n, then the convergence is linear. The second scenario is when 
there exists sq so that N(so/\n) = (for SVM, it means that Xn\w* T xiyi — 1| > sofor all i), and since 
erj > 8L 2 (so/Xn)N(so/Xn)/n, we again have a linear convergence of (2n/so) log(2/eo). In the third 
scenario, we assume that N{s/Xn)/n = 0[(s/Xn) u ] for some v > 0, we can take t£> = 0({s/Xn) l+u ) 
and obtain 

t>0{X-\r D ll{l+u) \og{2/e D )). 

The log(l/e£>) factor can be removed in this case with a slightly more complex analysis. This result is again 
superior to Theorem^for dual convergence. 

The following result shows fast convergence of duality gap using Theorem [5] 

Theorem 6. Consider Procedure SDCA with = 0. Assume that (f>i is L-Lipschitzfor all i and it satisfies 
Q. Let p < 1 be the largest eigenvalue of the matrix n _1 X^i=i x i x J ■ Define N(u) = #{i : 7, < u}. 
Assume that at time Tq > n, we have dual suboptimality ofK[D(a*) — D(a^ T °^)] < ejy, and define 



ep = inf 

7>0 



4L 2 2 £D 



n 



min(7, A7 2 /(2p))J ' 



then at time T = 2Tq, we have 

E[P(w) - D(a)} <e D + 1 g f 
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If for some 7, N(j)/n is small, then Theorem [6] is superior to Theorem [I] In particular, if for some 
7 > we have N(-/) = 0, then 

E[P{w) - D(a)} = 0(e D ). 

In this situation, the convergence rate for duality gap is linear as implied by the linear convergence of in 
Theorem [5] 



6 Examples 

We will specify the SDCA algorithms for a few common loss functions. For simplicity, we only specify the 
algorithms without SGD initialization. In practice, instead of complete randomization, we may also run in 
epochs, and each epoch employs a random permutation of the data. We call this variant SDCA-PERM. 



Prnrpdnrp "^DPA— PF.RMf a ■ ") 1 




Letw® =w(a<®) 




Let t = 




Iterate: for epoch k = 1, 2, . . . 




Let {ii, . . . , i n } be a random permutation of {1, . . . , n} 




Iterate: for j = 1, 2, . . . , n: 




t <- t+ 1 




i = ij 




Find Aaj to increase dual (*) 




a (t) 4- a (t-i) + Aaiei 




w (t) w (t-i) + ( / \n)~ 1 AajXi 




Output (Averaging option): 




Let a = t } Tq ZJ=t +i a(t 1} 




Let w = w(a) = t \ q Ei=T +i ™ ( * 1} 




return w 




Output (Random option): 




Let a = cr*) and w = for some random t € Tq + 1, . . 


■,T 


return w 





Lipschitz loss 

Hinge loss is used in SVM. We have (pi(u) = max{0, 1 — yi,u} and 4>*(— a) = —ayi with ayi G [0, 1]. 
Absolute deviation loss is used in quantile regression. We have <pi(u) = \u — yi\ and 4>*(— a) = —ayi with 
a 6 [-1,1]. 

For the hinge loss, step (*) in Procedure SDCA-PERM has a closed form solution as 

a In • /i 1 - xJiv^-Vyi (t-i) \\ ( t -i) 
Aa, : = yi max 0, mm 1, — — h a\ 'yi \ ) - a\ . 

V V H^ll2/( An ) // 

For absolute deviation loss, step (*) in Procedure SDCA-PERM has a closed form solution as 

a ( i • (i yi-xlw {t ~ l) . (t-i)n (t-i) 

Aaj = max -1, mm 1, — — „ ha — a.- 

V V lki[ll/(An) // 
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Both hinge loss and absolute deviation loss are 1-Lipschitz. Therefore, we expect a convergence behav- 
ior of no worse than 

O I n In n + — 

V Ae, 

without SGD initialization based on Theorem [T] The refined analysis in Section [5] suggests a rate that can 
be significantly better, and this is confirmed with our empirical experiments. 

Smooth loss 

Squared loss is used in ridge regression. We have <fii(u) = (u — yi) 2 , and </>*(— a) = —ayi + a 2 /A. Log 
loss is used in logistic regression. We have <f>i(u) = log(l + exp(— y{U)), and <p*(— a) = ayi ln(oyj) + (1 — 
ayi) ln(l - ayi) with G [0, 1]. 

For squared loss, step (*) in Procedure SDCA-PERM has a closed form solution as 

Vi - xJwV-V - o.5af ~ 1} 

Attj = n ,n 77\ T ■ 

0.5 + ||xi|||/(An) 

For log loss, step (*) in Procedure SDCA-PERM does not have a closed form solution. However, one 
may start with the approximate solution, 

_ _ (1 + exp(x7w (f ~ 1) j/i))~ 1 ^ - af _1) 
0i ~ max(l,0.25 + ||xi|||/(An)) 

and further use several steps of Newton's update to get a more accurate solution. 

Finally, we present a smooth variant of the hinge-loss, as defined below. Recall that the hinge loss 
function (for positive labels) is <fi(u) = max{0, 1 — u} and we have <j)*{— a) = —a with a G [0, 1]. Consider 

1' 

2 

Then, its conjugate, which is defined below, is (l/7)-smooth. We refer to it as the smoothed hinge-loss (for 
positive labels): 



adding to 4>* the term ^a 2 which yields the 7-strongly convex function 

0; (a) = </>» + Ja 2 . 



(j>j(x) = max 



7 2 

ax — a — —a 



oe[-l,Q] L 2 

x > 1 

1 - x - 7/2 x < 1 - 7 (7) 
[_L(1_ X )2 aw . 

For the smoothed hinge loss, step (*) in Procedure SDCA-PERM has a closed form solution as 

a ( n ■ (i l -XiW (t ~ l) yi-iat~ l) Vi , (t-i) \\ (t-1) 
Aai = yi max 0, mm 1, n — 2 h a\ Vi \ \ - a\ '. 

V V IMl27(An) + 7 // 

Both log loss and squared loss are 1 -smooth. The smoothed-hinge loss is I/7 smooth. Therefore we 
expect a convergence behavior of no worse than 

°(( n+ ^K 

This is confirmed in our empirical experiments. 
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7 Experimental Results 



In this section we demonstrate the tightness of our theory. All our experiments are performed with the 
smooth variant of the hinge-loss defined in Q, where the value of 7 is taken from the set {0, 0.01, 0.1, 1}. 
Note that for 7 = we obtain the vanilla non-smooth hinge-loss. 

In the experiments, we use ejj to denote the dual sub-optimality, and ep to denote the primal sub- 
optimality (note that this is different than the notation in our analysis which uses ep to denote the duality 
gap). It follows that eo + ep is the duality gap. 

7.1 Data 

The experiments were performed on three large datasets with very different feature counts and sparsity, 
which were kindly provided by Thorsten Joachims. The astro-ph dataset classifies abstracts of papers from 
the physics ArXiv according to whether they belong in the astro-physics section; CCAT is a classification 
task taken from the Reuters RCV1 collection; and covl is class 1 of the covertype dataset of Blackard, Jock 
& Dean. The following table provides details of the dataset characteristics. 



Dataset 


Training Size 


Testing Size 


Features 


Sparsity 


astro-ph 
CCAT 
covl 


29882 
781265 
522911 


32487 
23149 
58101 


99757 
47236 
54 


0.08% 
0.16% 
22.22% 



7.2 Linear convergence for Smooth Hinge-loss 

Our first experiments are with </> 7 where we set 7 = 1. The goal of the experiment is to show that the conver- 
gence is indeed linear. We ran the SDCA algorithm for solving the regularized loss minimization problem 
with different values of regularization parameter A. Figure[T]shows the results. Note that a logarithmic scale 
is used for the vertical axis. Therefore, a straight line corresponds to linear convergence. We indeed observe 
linear convergence for the duality gap. 

7.3 Convergence for non-smooth Hinge-loss 

Next we experiment with the original hinge loss, which is 1-Lipschitz but is not smooth. We again ran the 
SDCA algorithm for solving the regularized loss minimization problem with different values of regulariza- 
tion parameter A. Figure [2] shows the results. As expected, the overall convergence rate is slower than the 
case of a smoothed hinge-loss. However, it is also apparent that for large values of A a linear convergence is 
still exhibited, as expected according to our refined analysis. The bounds plotted are based on Theorem [T] 
which are slower than what we observe, as expected from the refined analysis in Section [5] 

7.4 Effect of smoothness parameter 

We next show the effect of the smoothness parameter. Figure[3]shows the effect of the smoothness parameter 
on the rate of convergence. As can be seen, the convergence becomes faster as the loss function becomes 
smoother. However, the difference is more dominant when A decreases. 

Figure [4] shows the effect of the smoothness parameter on the zero-one test error. It is noticeable that 
the difference in zero-one test error for different values of 7 is not very significant, and differs from data to 
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Figure 1: Experiments with the smoothed hinge-loss (7 = 1). The primal and dual sub-optimality, the 
duality gap, and our bound are depicted as a function of the number of epochs, on the astro-ph (left), CCAT 
(center) and covl (right) datasets. In all plots the horizontal axis is the number of iterations divided by 
training set size (corresponding to the number of epochs through the data). 
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Figure 2: Experiments with the hinge-loss (non-smooth). The primal and dual sub-optimality, the duality 
gap, and our bound are depicted as a function of the number of epochs, on the astro-ph (left), CCAT (center) 
and cov 1 (right) datasets. In all plots the horizontal axis is the number of iterations divided by training set 
size (corresponding to the number of epochs through the data). 
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data. Since in terms of runtime a larger value of 7 is preferable, this might suggest that the smooth versions 
of the hinge-loss are preferable. 



7.5 Cyclic vs. Stochastic vs. Random Permutation 

In Figure[5]we compare choosing dual variables at random with repetitions (as done in SDCA) vs. choosing 
dual variables using a random permutation at each epoch (as done in SDCA-Perm) vs. choosing dual 
variables in a fixed cyclic order (that was chosen once at random). As can be seen, a cyclic order does not 
lead to linear convergence and yields actual convergence rate much slower than the other methods and even 
worse than our bound. As mentioned before, some of the earlier analyses such as [7 ] can be applied both to 
stochastic and to cyclic dual coordinate ascent methods with similar results. This means that their analysis, 
which can be no better than the behavior of cyclic dual coordinate ascent, is inferior to our analysis. Finally, 
we also observe that SDCA-Perm is sometimes faster than SDCA. 



7.6 Comparison to SGD 

We next compare SDCA to Stochastic Gradient Descent (SGD). One clear advantage of SDCA is the avail- 
ability of a clear stopping condition (by calculating the duality gap). In Figure|6]and Figure[7]we present the 
primal sub-optimality of SDCA, SDCA-Perm, and SGD. As can be seen, SDCA converges faster than SGD 
in most regimes. SGD can be better if both A is high and one performs a very small number of epochs. This 
is in line with our theory of Section|4] However, SDCA quickly catches up. 

8 Proofs 

For convenience, we list the following simple facts about primal and dual formulations, which will used in 
the proofs. For each i, we have 

—a* G d(fii(w* T Xi), w* T Xi G d(j)*(— a*), 

and 

1 n 
An ^— ' 

8=1 

8.1 Proof of Theorem i 

The key lemma is the following: 

Lemma 1. Assume that (j)* is ^-strongly-convex (where 7 can he zero). Then, for any iteration t and any 
s G [0, 1] we have 

E[£>(a«) - D(a (t ^)} > - E[P(w^) - Dia^)] - (-Y ^ , 

n \nJ 2A 

where 

n 



Git) = ^e(imi 2 - 7(1 7 )An ) mut l) - 4~ 1} n 

and —u^ 1 ' G d(pi(xj w^~^). 



n 
i=i 
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Figure 4: Zero-one test error as a function of the value of 7. 



Proof. Since only the i'th element of a is updated, the improvement in the dual objective can be written as 



n[D(a®) - D(a^)] = (-^(-af) 



An , 



w 



(*)||2 



-a 



An , 



w 



(t-l)||2 



By the definition of the update we have for all s G [0,1] that 

A = max-0*(-(af -1) + At*)) - ^||u/ (t-1) + (An^Aa^) 



> -0*(-(af- 1} + .(u® - af- 1} ))) - ^[k^ + (An)" 1 ^ - af" 1 ^! 



(8) 



From now on, we omit the superscripts and subscripts. Since cff is 7-strongly convex, we have that 

4>*{-{a + s(u-a))) = (jf{s{-u) + {l-s)(-a)) < s<f>*(-u) + (1 - s)(f>*(-a) - |s(l - s){u-a) 2 (9) 
Combining this with ([8]) and rearranging terms we obtain that 

A > -s(j)*{-u) - (1 - s)4>*(-a) + |s(l - s)(u - a) 2 - -^\\w + (An) _1 s(n - a)x|| 2 

An „ ri9 , s T s 2 (^ — a) 2 1, ,|2 



u) — (1 — s)(f)*(—a) + t^ s (1 ~~ ~~ a ) 2 ^rllHI 2 ~~ s ( u ~ a)w T x 



2Xn 



-s((f>*(-u) + uw T x) + (-f(-a) - ^\\w\\ 2 ) +^ (7(1 - s) - 
„ ' 2 2 \ An 



(u — a) 2 + s((j)*(—a) + an? 



S(f>(w T x) g 

where we used — u G <90(u) T x) which yields 4>*(— u) = —uw T x — (j)(w T x). Therefore 



A-B > s 
Next note that 



w T x) + </>*(— a) + aw 1 x + 



T / 7(1 — s) s\\x\ 



2Xn 



[u — a) 



(10) 



1 " A I 1 n A N 

P(-w) - D(a) = - S ^(t)i{w T Xi) + -w T w - -- y^0*(-ai) - -u> T u> 

i=l V i=l y 

1 ™ 

= ~ ^ ( ( t ) i(w T Xi) + 4>*{-a.i) + aiiv T Xi 



16 



10" 



10" 



10- 



10- 



astro-ph 



CCAT 



covl 





■■«■■ SDCA 






DCA-Cyclic 






— SDCA-Perm 


10" 1 




Bound 




\ ° S " V 




10~* 














\ °^^^4^ 






\ ^T"-— 






\ ° 




10" 5 



10 12 





•••©•• SDCA 






DCA-Cyclic 






— SDCA-Perm 


10"' 




Bound 




- \sp 




10" 2 


\\° 






A \°. \ 




10" 3 








V 










10"" 


\ ° 











10" 5 




50 100 150 200 250 300 350 400 450 



"•" SDCA 

DCA-Cyclic 

-*- SDCA-Perm 
Bound 




o SDCA 

— DCA-Cyclic ; 
-*— SDCA-Perm 

- - Bound 






SDCA 




DCA-Cyclic 




— SDCA-Perm 




Bound 


° \ 
\ Q \ 




\\° 




V N. ° 












\ O. s — 




\ O. \ 




o \ 







«* SDCA 




DCA-Cyclic 




— SDCA-Perm 




Bound 






o\ 




6 ' ^^-^ 




0' 




1 o\ 





'■«■«" SDCA 






"■©" SDCA 




DCA-Cyclic 






DCA-Cyclic 




-«-SDCA-Perm 




) 1 


-*- SDCA-Perm 




Bound 






Bound 






10 2 


lo . 




-IP » 

1° 1 






1° \ 
-16 1 
lo \ 




f* « 

O i 
*G 1 
10 \ 

T P t 




10" 1 


1° \ 

lo i 

1° *» 
1 b » 




I b \ 
L 






lb \ 

- k o \ 







O SDCA 




DCA-Cyclic 


° \ 


—m— SDCA-Perm 


O % 


Bound 


° %s 




vv 




\ ° \ 




\ >s. \ 








\ "~~ ls \ 




\ o \ — 




V ° \ 




\ 





16 18 20 





O SDCA 




DCA-Cyclic 




-••-SDCA-Perm 




Bound 


4 




Q 




lo 













9 




P 











SDCA 




DCA-Cyclic 


i 


— SDCA-Perm 


1 


Bound 






^ ~^ 








- o 1 , 




O' 









50 100 150 200 250 300 350 400 450 



Figure 5: Comparing the duality gap achieved by choosing dual variables at random with repetitions 
(SDCA), choosing dual variables at random without repetitions (SDCA-Perf), or using a fixed cyclic order. 
In all cases, the duality gap is depicted as a function of the number of epochs for different values of A. The 
loss function is the smooth hinge loss with 7 = 1. 
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Figure 6: Comparing the primal sub-optimality of SDCA and SGD for the smoothed hinge-loss (7 = 1). 
In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the 
number of epochs through the data). 
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Figure 7: Comparing the primal sub-optimality of SDCA and SGD for the non-smooth hinge-loss (7 = 0). 
In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the 
number of epochs through the data). 
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Therefore, if we take expectation of ( fTO] ) w.r.t. the choice of i we obtain that 

Iy^/^ii™n2 r y(l-s)\n 



-E[A-B]>E[P(«,)-J3(a)]--f 
s l\n n 



i=l 



(Ui ~ OLiY 



We have obtained that 



2Xn 



(11) 

□ 



Multiplying both sides by s/n concludes the proof of the lemma. 

We also use the following simple lemma: 
Lemma 2. For all a, D(a) < P(w*) < P(0) < 1. In addition, D(0) > 0. 

Proof. The first inequality is by weak duality, the second is by the optimality of w* , and the third by the 
assumption that <pi(0) < 1. For the last inequality we use — </>|(0) = — max z (0 — 4>i(z)) = mm z 4>i(z) > 0, 
which yields D(0) > 0. □ 

Equipped with the above lemmas we are ready to prove Theorem [2] 

Proof of Theorem^ The assumption that is (l/7)-smooth implies that <fi* is 7-strongly-convex. We will 



apply Lemma 1 with 



€ [0, 1]. Recall that \\xi\\ < 1. Therefore, the choice of s implies that 



INI 2 - 7(1 < 0, and hence G® < for all t. This yields, 

E[D(a®) - Dfa^)] > - ElPfw^) - D(a^)] . 

n 

But since := D(a*) -D{a^) < P{w^) -D(a^) and D(a®) - D(a^) = e£ _1) -eg } , 

we obtain that 



n4 } ] < (1 - fd < (i - < 0- - ^ exp(-*t/n) = exp (-3-^ j 

*> (™+^) l0g(l/€jj). 



This would be smaller than eo if 
It implies that 



E[P(toW)-D(aW)] < ^E[cg - e£ +1) ] < ^^]. 
So , requiring < we obtain a duality gap of at most ep. This means that we should require 



(12) 



A7 / ep. 



which proves the first part of Theorem |2| 



Next, we sum ( fT2| ) over i = To , . . . , T — 1 to obtain 

T-l 



E 



l_£(P(«,('>)_i>( a <t))) 



T-T 



i=T 



< 



n 



E[Z?(a( T )) -L>(a( T °))]. 



20 



Now, if we choose w, a to be either the average vectors or a randomly chosen vector over t G {To + 
1, . . . , T}, then the above implies 

E[P(w)-D(a)} < n E[D(a^) - D(a™)} < " Eft 

It follows that in order to obtain a result of E[P(w) — D(a)] < ep, we only need to have 

F^Ki < s{T ~ T )e P _ (T - T )e P 

This implies the second part of Theorem |2| and concludes the proof. □ 
8.2 Proof of Theorem [T] 

Next, we turn to the case of Lipschitz loss function. We rely on the following lemma. 

Lemma 3. Let : R — >• M be an L-Lipschitz. function. Then, for any a s.t. \a\ > L we have that 

4>*(a) = oo. 

Proof. Fix some a > L. By definition of the conjugate we have 

(j>*(a) = supfax — <j>(x)\ 

X 

> -0(0) + supfax - {<j)(x) - 0(0))] 

X 

> -0(0) + sup[a x - L\x - 0|] 

X 

> —0(0) + sup(a — L) x = oo . 

x>0 

Similar argument holds for a < — L. □ 

A direct corollary of the above lemma is: 

Lemma 4. Suppose that for all i, 4>i is L-Lipschitz. Let G^ be as defined in Lemma\I\(with 7 = 0). Then, 
GW<4L 2 . 

Proof. Using Lemma [i] we know that \af ^\ < L, and in addition by the relation of Lipschitz and sub- 
gradients we have \uf 1) \<L. Thus, (uf 1] - of 1} ) 2 < 4L 2 , and the proof follows. □ 

We are now ready to prove Theorem [T] 

Proof of Theorem^ Let G = maxj G® and note that by Lemma |4] we have G < 4L 2 . Lemma [TJ with 
7 = 0, tells us that 

E[D(a®) - Dia^-V)] > - E[P(w^) - D(a^)] -(~Y~, (13) 

n \n/ 2A 

which implies that 
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We next show that the above yields 



n4 ] ) < 



2G 



X(2n + t-t ) 



(14) 



for all t > to = max(0, [~nlog(2Ane^ /G)~\ ). Indeed, let us choose s = 1, then at t = to, we have 



^D-l ^ \ l n) e D + 2An2 l-(l-l/n) 



1 < p-t/nM ,jG_ < G_ 

— e e D ^ 2An - An • 



This implies that ([T4j) holds at t = to- For * > *o we use an inductive argument. Suppose the claim holds 
for t — 1, therefore 



2G 



+ 



* 2 G 



< (1 - $ + (n) 2 S < (1 - 5) A(2n +f -l- f0 ) - 2V 

Choosing s = 2n/(2n - t + 1 - 1) £ [0, 1] yields 



E[eg>] < (l 



2G 



G_ 

2n-t +t-l I X(2n-t +t-l) I 2n-t +t-l j 2A 



2G 



1 



1 



A(2n-t +t-l) V, 2n-t +t-l 

_ 2G 2n-t +t-2 

\(2n-t +t-l) 2n-t +t-l 
< 2G 2n-t +t-l 

— X(2n-t +t-l) 2n-t +t 

2G 

\(2n~t +t) - 



This provides a bound on the dual sub-optimality. We next turn to bound the duality gap. Summing ( fl3| ) 
over t = To + 1, . . . , T and rearranging terms we obtain that 



E 



T-T Q 



t=T +l 



< 



n 



E[D(a^) - D(a<M)] + 



sG 
2Xn 



Now, if we choose w, a to be either the average vectors or a randomly chosen vector over t 6 {To + 
1, . . . , T}, then the above implies 

sG 



E[P(w) - D(a)\ < .J 1 E[D(a {T) ) - D(a {To) )} + 



2An 



s(T - T 0) 

If T > n + To and To > to, we can set s = n/(T — To) and combining with ( [14] ) we obtain 

E[P(w) - D(a)] < E[TJ(«( T )) - D(a {To) )} + °' 

< E[D(a*) - D(a (To) )} + 



2A(T - T ) 
G 



< 



2G 



2A(T - T ) 
G 



A(2n - t + T ) 2A(T - T ) " 



A sufficient condition for the above to be smaller than ep is that To > — 2n + to and T > To + j^-. It 

also implies that E[D(a*) - T»(a (To) )] < e P /2. Since we also need T > t and T - T > n, the overall 
number of required iterations can be 

T > max{t , 4G/(Ae P ) - 2n + t }, T — T > max{n, G/(Ae P )}. 

We conclude the proof by noticing that < 1 (Lemma[2j), which implies that to < max(0, \n log(2An/G)] ). 

□ 
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8.3 Proof of Theorem|3] 

We assume that (0 t , x t ) are randomly drawn from a distribution D, and define the population optimizer 

w* D = axgtahxP D (w), P D (w) = E^ x)r ^ D <p(w T x) + ^\\w\\ 2 

By definition, we have P(w*) < P(w* D ) for any specific realization of {(4>t, Xt) : t = 1, . . . , n}. Therefore 

EP(w*) < EP(w* D ) =EP D (w* D ), 

where the expectation is with respect to the choice of examples, and note that both P(-) and w* are sample 
dependent. 

After each step t, we let oft' = [a±, . . . , at], and let — u £ dcpt+i(xj + iw^). We have, for all t, 



(t + l)D t+ M t+1) ) ~ = -faii-c&P) " (*+ V-Ww^W 2 + t±\\v,Wf 

>-^W- 2 ^l|A^ + ^ +1 f +2 L||AW*)f 

Yt+iK > t + 1 t+i 2X\t t + lj" 11 2(t + l)A 



H+lW ~ x t+l w^ u + ( 1 - x? +1 w® u + 2 ( t l^ x ( Jl^-JL - u 2 ||^+i|| 2 



= t+1 (x7 +1 -W) + (W +1 - W « + - . 2 ||*ml| 2 ) 

Next note that [|AioW — urc t+1 || < 2L (where we used the triangle inequality, the definition of and 
Lemma [3]). Therefore, 

x or 2 

(t + l)A+i(o (t+1) ) - W t (aM) > ^ t+ M t] T ^+i) + Sll^ll 2 



2" " (t + l)A 

Taking expectation with respect to the choice of the examples, and note that the (t + l)'th example does not 
depend on we obtain that we obtain 

E[(t + l)A+i(« (t+1) )-tA(« W )] 

IT 2 IT 2 
>E[P D (w®)] - ——— > E[P D (w* D )] 



(t + l)X ~ L " v un (t + l)A 

?T 2 9T 2 
> E[P(w*)] - f \ x = E[D{a*)] - r^\^. 
~ 1 v n + 1 A (t + l)A 
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Using Lemma [2] we know that D t (a^) > for all t. Therefore, by summing the above over t we obtain 
that 

E[nD(a^)]>nE[D(an]- 2L2 \ g{en \ 

X 

which yields 

EP(aV%W)l< 2L2 '° g(ew) . 

ATI 

8.4 Proof of Theorem U 

The proof is identical to the proof of Theorem[T] We just need to notice that at the end of the first stage, we 
haveEe^ < 2L 2 log(en)/(An). It implies that t < max(0, [Ytlog(2An • 2L 2 \og(en)/(XnG))]). 

8.5 Proof of Proposition [I] 

Consider any feasible dual variable a and the corresponding w = w(a). Since 



Xn ^ An 

i=l i=l 



we have 



n * — ' ' 

8=1 



A(w-u>*) T w* = ^^(c 

Therefore 

D(a*)-D(a) 

1 n x 

= - ^ - ^i(-Oi) + («i - «*K T ^] + ^[^ T ^ - ^* T ^* - 2(w - w*) T w*] 

i=l 

1 n A 

= ~ XI ~ ^*( _a i) + ( a i ~ a*)w* T Xi + -(«) - w*) T (w - w*). 

n L — ' L J 2 

i=l 

Since w* T Xj € d(p*(—a*), we have 

a») - 4>*(— a*) + (ai - a*)w* T Xi > 7i(a» - a*) 2 . 

By combining the previous two displayed inequalities, we obtain the first desired bound. 

Next, we let u = w* T Xi, v = w T Xi. Since — G d(f>i(v) and —a* G d<j)i{u), it follows that u G 
<9c/>*(-a*) and u G d(j)*{-ai). Therefore 

|it — v\ ■ \a* — Oj| 

= - 0i (-aj) + u( ai - a*)] + [#(-a?) - + u(a? - a;)] 



>0 >0 

><t>*i{-ai) - 4>*{-a*) + u(ai - a*) > Ji(u)\a,i - a* 



*|2 



This implies the second bound. 
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8.6 Proof of Theorem \5\ 

The following lemma is very similar to Lemma [T] with nearly identical proof, but it focuses only on the 
convergence of dual objective function using ([5]). 

Lemma 5. Assume that (|5]) is valid. Then for any iteration t and any s E [0, 1] we have 

E[D(a®) - D(a^)} > — E[D(a*) - D(a^)] + — \\w* - w^" 2 fS ^ 



2n > n 1 4 n " 1,2 Vn7 2A ' 

where 



Proof. Since only the i'th element of a is updated, the improvement in the dual objective can be written as 



n[D(a®) - D(aW)] = (-0*(-af } ) - ^lh«|| 2 ) - (-f (-af^) - ^\\w^f 

S „ ' V v 

At Bi 
By the definition of the update we have for all s E [0, 1] that 



At = max-0*(-(af" 1) + At*)) - + (Ar^Aa^ 

Ad; 1 



|2 



We can now apply the Jensen's inequality to obtain 



A > s^{-a*) - (1 - sm-at 1] ) ~ YW wit ~ 1] + - 4~ l) )xit 



S 2 K-af-^) 2 2 
2An ' 

By summing over i = 1, . . . , n, we obtain 



j=l i=l i=l 

2 n 

S ^ * > (*- 1 )>2|U ||2 



2An 



E(«* 



i=i 

„2 ra 



2A „£«-«f- 1, > 2 ii* 

«=i 



(*-l)\2||„ || 2 
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where the equality follows from X^ILi( Q i ~~ a f ^) x i = Xn(w* — By rearranging the terms on 

the right hand side using (w* - w^ 1 ^ w^~^ = [|io*[||/2 - Hu/* -1 ) |||/2 - \\w* - u/ t_:L) |||/2, we obtain 



^[A-Bi]>-a^2 



i=i 



f ^| k *_^-l)||2 



2 ra 



S ^' * ^*- 1) ) 2 ||x-" 2 
2An 



2An 

1=1 



We can now apply (|5]) to obtain 



i=i ' i=i 



This implies the desired result. □ 

Lemma 6. Suppose that for all i, (pi is L-Lipschitz- Let G$ be as defined in Lemma [5] Then 

G?) < 4L^/(An)) 
n 

Proof. Similarly to the proof of Lemma|4j we know that (a* — af ^) 2 < 4L 2 . Moreover, \\xi\\\ < 1, and 
ll^illl ~~ "^T^ — when 7$ > s/(An). Therefore there are no more than N(s/(Xn)) data points i such that 
II III ~~ ^ s P os iti ve - The desired result follows from these facts. □ 

Proof of Theorem^} Let eg } = E[D(a*) - D(a®)], and G*(s) = 4L 2 iV(s/An)/n. We obtain from 
Lemma [5] and Lemma [6] that 

e g><(i- s /(2 n )) £ r> + ^ 2G - ( '° 

It follows that for all i > we have 



nJ 2A 



(*) wi „//o^UQ) , 1 ( s \ 2 G *( s ) 

1 - (1 - s/{2n)) \n) 2A 



6g><(l-*/(2»))*eg J + 



< e -t/a« + f s ) ^M 5 ! < e -*t/2n + Cd/2i 
Vn/ A 

It follows that when 

t> (2n/a)log(2/ej,), 

we have eg^ < e£>. □ 
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8.7 Proof of Theorem g] 

Let = E[D(a*) - D(a®)]. From Proposition fll we know that for all t > T : 



i=l 



n 

n ^ 



ji E |af - a* | 2 + A E((«)W - w*) T ^) 2 
2 P 



where — u^* ^ 6 d<pi(xj w^). It follows that given any 7 > 0, we have 



ETC- I (*) (t-l)|2 in* I (*) 1) 1 2 , 2 r m 1 (*) *|2 , 10/ (t-1) *n2" 

E|a 4 w — u y >\ < supE|a 4 w — u\ \ H — ^ ^EjaJ — +E(« l - - * 



i=l 



a,- 



m I (t) (*-l)|2 , n 

< sup E a; — u • ' H — 

n 



i:7i>7 



2e 



(0 
1? 



n min(7, Xj 2 /(2p)) 



min(7, A7 2 /(2p)) 



where Lemma|4|is used for the last inequality. Since 7 is arbitrary and < sd, it follows that 

n 

-YR\a®-uW\ 2 <e P . 



n 



i=l 



Now plug into Lemma [T] we obtain for all t > To + 1: 

2 1 n 

>£ E[ P(w(«))- D ( a <-))]-(£) JL_£e [( «! 

i=l 

>iE[P(«,(^))-D(a^))]-(^) 2 g. 
By taking s = u/Tq, and summing over t = To + 1, . . . , 2To = T, we obtain 

e D > eg 50 - e { P > E[P(w) - D{a)\ 
This proves the desired bound. 



it-l) _ 



2AT ' 
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